fix(queue): treat 200 + non-REVALIDATED response as success#1272
fix(queue): treat 200 + non-REVALIDATED response as success#1272alex-all3dp wants to merge 1 commit into
Conversation
The DO queue handler previously threw FatalError when the HEAD self-fetch returned 200 with an x-nextjs-cache value other than REVALIDATED. This commonly happens under load when in-isolate stale-while-revalidate has already regenerated the page before the queued HEAD reaches the ISR handler — the page is fresh in cache but Next.js returns HIT/STALE rather than REVALIDATED. Fall through to the success path (sync table updated, failed state cleared) and log a debug message instead.
commit: |
conico974
left a comment
There was a problem hiding this comment.
@alex-all3dp Is it for PPR only ?
| ); | ||
| if (response.status === 200) { | ||
| const cacheStatus = response.headers.get("x-nextjs-cache"); | ||
| if (cacheStatus !== "REVALIDATED") { |
There was a problem hiding this comment.
We shouldn't treat STALE as a success that's for sure.
But even HIT should never happen (maybe in PPR, haven't tested that), we are supposed to remove the in isolate revalidation, if not it means there is a bug that we need to fix
There was a problem hiding this comment.
Thanks for the review @conico974
Not PPR-only, we hit it on regular ISR routes (dynamic = 'error' revalidate from 1 week to 1 year), no cacheComponents/PPR enabled. Our open-next.config.ts is
const openNextConfig: OpenNextConfig = {
...defineCloudflareConfig({
incrementalCache: withRegionalCache(r2IncrementalCache, {
mode: 'long-lived',
}),
queue: queueCache(doQueue, {regionalCacheTtlSec: 30, waitForQueueAck: false}),
tagCache: doShardedTagCache({
baseShardSize: 4,
regionalCache: true,
}),
cachePurge: purgeCache({type: 'durableObject'}),
}),
dangerous: {
middlewareHeadersOverrideNextConfigHeaders: true,
},
cloudflare: {
skewProtection: {
enabled: true,
maxNumberOfVersions: 10,
maxVersionAgeDays: 20
},
},
}You're right on both points though. If in-isolate SWR-triggered regen shouldn't exist alongside the queue's HEAD regen, then this PR would just paper over an underlying bug it seems.
How do ypu prefer to handle this PR? Closing it would be reasonable. Alternatively narrow it to handle HIT only, drop STALE, add a guard rail ((e.g. only treat HIT as success when the cache freshness is verifiable) and mark the change as a defense for potential race conditions?
Problem
DOQueueHandler.executeRevalidationthrowsFatalError("…cannot be done. This error should never happen.")whenever the HEAD self-fetch returns200with anx-nextjs-cachevalue other thanREVALIDATED. In real-world deployments serving meaningful traffic this happens routinely: when a user hits a stale page, the worker's in-isolate stale-while-revalidate regenerates the page in parallel with the queued revalidation. By the time the DO's HEAD self-fetch reaches the ISR handler, the page is already fresh in the incremental cache and Next.js returnsHIT(orSTALE) rather thanREVALIDATED.This produces noisy
FatalErrorlogs on healthy, correctly-serving pages. The desired outcome — fresh content in cache — has been achieved; the request just took a different path.Reproduces under load with any throttling that delays the queued HEAD relative to the in-isolate SWR. Reported in #662; seen in production on opennextjs-cloudflare 1.19.9, Next.js 16.2.6 across deeply nested ISR routes (
dynamic = 'error',revalidatein the 1-week–1-year range).Fix
In
packages/cloudflare/src/api/durable-objects/queue.ts, fall through to the existing success path whenstatus === 200regardless of thex-nextjs-cachevalue:INSERT OR REPLACE INTO sync …).routeInFailedState.debuglog records the non-REVALIDATED status for observability.The
FatalErrorimport is dropped (no other usage).Test changes
The existing test
"should not put it in failed state for an incorrect 200"(underfailed revalidation) is replaced by"should treat 200 + non-REVALIDATED response as success (page already regenerated by another path)"undersuccessful revalidation, which now also asserts the sync-table write.All 22 tests in
queue.spec.tspass; full suite (328 tests) passes;pnpm code:checksclean.Risk
Low. Behavior changes only for the previously-fatal branch. 200+REVALIDATED, 404, 500, and other non-200 responses are unchanged.
Closes #662